Data Explorer¶

In this notebook example, we'll take a look at Datumaro data exploration Python API. Specifically, we are going to provide the example codes for data exploration for image query and text query with MS-COCO 2017 dataset. Please prepare COCO 2017 validation dataset or download it referred this link COCO Prerequisite.

In [1]:
# Copyright (C) 2022 Intel Corporation
#
# SPDX-License-Identifier: MIT

import datumaro as dm
from datumaro.components.searcher import Searcher
from datumaro.components.visualizer import Visualizer

Data exploration¶

Explore with COCO instance segmentation dataset¶

To use data exploration, we need to define hash for each dataset. So, set save_hash as True. The default value is False.

In [2]:
dataset = dm.Dataset.import_from("coco_dataset", format='coco_instances')
dataset
/home/dwekr/miniconda3/envs/datum/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
WARNING:root:File '/media/hdd2/datumaro/coco_dataset/annotations/panoptic_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File '/media/hdd2/datumaro/coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File '/media/hdd2/datumaro/coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
Out[2]:
Dataset
	size=5000
	source_path=/media/hdd2/datumaro/coco_dataset
	media_type=<class 'datumaro.components.media.Image'>
	annotated_items_count=4952
	annotations_count=78647
subsets
	val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['bbox', 'mask', 'polygon']
infos
	categories
	label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

Set searcher with dataset which is used to database.

In [3]:
searcher = Searcher(dataset)

Explore with image query¶

Set one of dataset as query which you want to find similar dataset.

In [4]:
for i, item in enumerate(dataset):
    if i==50:
        query = item

Use Visualizer to check which query is used.

In [5]:
visualizer = Visualizer(dataset, figsize=(12, 12), alpha=0)
fig = visualizer.vis_one_sample(query.id, "val2017")
fig.show()
In [6]:
topk_list = searcher.search_topk(query, topk=15)
In [7]:
subset_list = []
id_list =[]
for result in topk_list:
    subset_list.append(result.subset)
    id_list.append(result.id)
In [8]:
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12], (None, None))
fig.show()

Explore with text query¶

Set text as query which you want to find similar dataset. You can set it as a sentence or a word.

In [9]:
topk_list = searcher.search_topk('elephant', topk=15)
In [10]:
subset_list = []
id_list =[]
for result in topk_list:
    subset_list.append(result.subset)
    id_list.append(result.id)
In [11]:
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12], (None, None))
fig.show()